Section: New Results
Parallel optimization methods revisited for multi-core and many-core (co)processors
Participants: J. Gmys and N. Melab
This contribution is a joint work with M. Mezmaz, E. Alekseeva and D. Tuyttens from University of Mons (UMONS) and T. C. Pessoa and F. H. De Carvalho Junior from Universidade Federal Do Cearà (UFC), Brazil.
On the road to exascale, coprocessors are increasingly becoming key building blocks of High Performance Computing platforms. In addition to their energy efficiency, these many-core devices boost the performance of multi-core processors. During 2016, we first have revisited the design and implementation of parallel Branch-and-Bound (B&B) algorithms using the work stealing paradigm on GPU accelerators [16][40], multi-GPU systems [17], multi-core processors [15] and MIC (Xeon Phi) coprocessors [20]. The challenge is to take into account the high irregular nature of the B&B algorithm and the hardware characteristics of GPU, Xeon Phi and multi-core (co)processors. Several work stealing strategies have been investigated while addressing several issues: host-device data transfer, thread divergence and data placement on the hierarchy of memories of the GPU and vectorization on Xeon Phi. The proposed approaches have been extensively experimented considering permutation-based optimization problems (e.g. FSP). The results reported in the cited papers demonstrate the efficiency of the many-core approaches compared to their multi-core counterpart. An extension of the proposed approaches to large hybrid clusters, including multi-core and many-core (co)processors is already started in [27].
The second part of the contribution consists in proposing a new hyper-heuristic (generalized GRASP) together with its parallelization for multi-core processors [11]. A cost function based on a bounding operator (used in B&B) is integrated to GRASP for the first time. Multi-core computing is used to investigate 315 GRASP configurations. In order to improve the performance of the local search procedure used in GRASP, we have proposed in [33] an original vectorization of the cost function of the makespan of FSP on Xeon Phi coprocessors. The reported results show that speed-ups up to 4.5 can be achieved compared to a non-vectorized apprpoach.